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Abstract.  We  give  primary-backup  protocols  for  various  models  of  fail¬ 
ure.  These  protocols  are  optimal  with  respect  to  degree  of  replication, 
failover  time,  and  response  time  to  client  requests. 


1  Introduction 

One  way  to  implement  a  fault-tolerant  service  is  to  employ  multiple  sites  that 
fail  independently.  The  state  of  the  service  is  replicated  and  distributed  among 
these  sites,  and  updates  Me  coordinated  so  that  even  when  a  subset  of  the  sites 
fail,  the  service  remains  available. 

A  common  approach  to  structuring  such  replicated  services  is  to  designate 
one  site  as  the  primary  and  all  the  others  as  backups.  Clients  make  requests  by 
sending  messages  only  to  the  primary.  If  the  primMy  fails,  then  a  failover  occurs 
and  one  of  the  backups  takes  over.  This  service  architecture  is  commonly  called 
the  primary-backup  or  the  primary-copy  approach  [1]. 

In  [5]  we  give  lower  bounds  for  implementing  primMy-backup  protocols  under 
various  models  of  failure.  These  lower  bounds  constrain  the  degree  of  replication, 
the  time  during  which  the  service  can  be  without  a  primary,  and  the  amount  of 
time  it  can  take  to  respond  to  a  client  request.  In  this  paper,  we  show  that  most 
of  these  lower  bounds  are  tight  by  giving  matching  protocols. 

Some  of  the  protocols  that  we  describe  have  surprising  properties.  In  one 
case,  the  optimal  protocol  is  one  in  which  a  non-faulty  primary  is  forced  to 
relinquish  control  to  a  backup  that  it  knows  to  be  faulty!  However,  the  existence 
of  such  a  scenMio  is  not  peculiar  to  our  protocol.  As  shown  in  [5],  relinquishing 
control  to  a  faulty  backup  is  indeed  necessary  to  achieve  optimal  protocols  in 
some  failure  models.  Another  surprise  is  that  in  some  protocols  that  achieve 
optimal  response  time,  the  site  that  receives  the  request  (i.e.  the  primMy)  is 
not  the  site  that  sends  the  response  to  the  clients.  We  show  that  this  anomaly  is 
not  idiosyncratic  to  our  protocols — it  is  necessary  for  achieving  optimal  response 
time. 
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Ames  grant  number  NAG  2-593  and  by  grants  from  IBM  and  Siemens. 
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from  IBM  Endicott  Programming  Laboratory. 


The  rest  of  the  paper  is  organized  as  follows.  Section  2  gives  a  specification  for 
primary-backup  protocols,  Sect.  3  discusses  our  system  model,  Sect.  4  summa¬ 
rizes  the  lower  bounds  from  [5],  and  Sect.  5  summarizes  our  results.  Sections  6, 
7  and  8  describe  the  protocols  that  achieve  our  lower  bounds,  and  Sect.  9  de¬ 
scribes  a  protocol  in  which  the  primary  is  forced  to  relinquish  control  to  a  faulty 
backup.  We  conclude  in  Sect.  10.  Due  to  lack  of  space,  the  description  of  some  of 
the  protocols  and  all  proofs  are  omitted  from  this  paper.  See  [4]  for  a  complete 
description  and  proofs. 

2  Specification  of  Primary-Backup  Services 

Our  results  apply  to  any  protocol  that  satisfies  the  following  four  properties, 
and  many  primary-backup  protocols  in  the  literature  ( e.g .  [1,2,3])  do  satisfy 
this  characterization. 

Pbl:  There  exists  predicate  Prmy,  on  the  state  of  each  site  s.  At  any  time,  there 
is  at  most  one  site  s  whose  state  satisfies  Prmys. 

Pb2:  Each  client  i  maintains  a  site  identity  Desti  such  that  to  make  a  request, 
client  i  sends  a  message  (only)  to  Desti- 

For  the  next  property,  we  model  a  communications  network  by  assuming  that 
client  requests  are  enqueued  in  a  message  queue  of  a  site. 

Pb3:  If  a  client  request  arrives  at  a  site  that  is  not  the  primary,  then  that  request 
is  not  enqueued  (and  is  therefore  not  processed  by  the  site). 

A  request  sent  to  a  primary-backup  service  can  be  lost  if  it  is  sent  to  a  faulty 
primary.  Periods  during  which  requests  are  lost,  however,  are  bounded  by  the 
time  required  for  a  backup  to  take  over  as  the  new  primary.  Such  behavior  is 
an  instance  of  what  we  call  bofo  ( bounded  outage  finitely  often).  We  say  that 
an  outage  occurs  at  time  t  if  some  client  makes  a  request  at  that  time  but  does 
not  receive  a  response1 .  A  (k,  A)-bofo  server  is  one  for  which  all  outages  can 
be  grouped  into  at  most  k  periods,  each  period  having  duration  of  at  most  A.2. 
The  final  property  of  the  primary-backup  protocols  is  that  they  implement  a 
bofo-server  (for  some  values  of  k  and  A). 

Pb4:  There  exist  fixed  and  bounded  values  k  and  A  such  that  the  service  behaves 
like  a  single  (k,  <d)-bofo  server. 

Clearly,  Pb4  can  not  be  implemented  if  the  number  of  failures  is  not  bounded. 
In  particular,  if  all  sites  fail,  then  no  service  can  be  provided  and  so  the  service 
is  not  (k,  A)  for  any  finite  k  and  A. 

1  For  simplicity,  we  assume  in  this  paper  that  every  request  elicits  a  response. 

3  Therefore,  as  well  as  being  finite,  the  number  of  such  periods  of  service  outages  can 
occur  is  also  bounded  (by  k). 


3  The  Model 


Consider  a  system  with  n,  sites  and  nc  clients.  Site  clocks  are  assumed  to  be 
perfectly  synchronized  with  real  time3.  Clients  and  sites  communicate  through  a 
completely  connected,  point-to-point,  FIFO  network.  Furthermore,  if  processes 
(clients  or  sites)  p,-  and  pj  are  connected  by  a  (nonfaulty)  link,  then  we  assume 
for  some  a  priori  known  6,  a  message  sent  by  p,-  to  pj  at  time  t  arrives  at  p,  at 
some  time  t'  €  (t..t  +  £]. 

We  assume  that  Jill  clients  are  non-faulty  and  consider  the  following  types 
of  site  and  link  failures:  crash  failures  (faulty  sites  may  halt  prematurely;  until 
they  halt,  they  behave  correctly)  4,  crash+link  failures  (faulty  sites  may  crash  or 
faulty  links  may  lose  messages),  receive-omission  failures  (faulty  sites  may  crash 
or  omit  to  receive  some  messages),  send-omission  failures  (faulty  sites  may  crash 
or  omit  to  send  some  messages),  general-omission  failures  (faulty  sites  may  fail 
by  send-omission,  receive-omission,  or  both).  Note  that  link  failures  and  the 
various  types  of  omission  failures  are  different  only  insofar  as  a  message  loss  is 
attributed  to  a  different  component.  Link  failures  are  masked  by  adding  redun¬ 
dant  communication  paths;  omission  failures  are  masked  by  adding  redundant 
sites.  As  we  will  see,  the  lower  bounds  for  the  two  cases  are  different. 

Let  /  be  the  maximum  number  of  components  that  can  be  faulty  ( i.e .  f  is 
the  maximum  number  of  faulty  sites  in  the  case  of  crash,  send-omission,  receive- 
omission  and  general-omission  failures,  whereas  /  is  the  maximum  number  of 
faulty  sites  and  links  in  the  case  of  crash+link  failures). 


4  Lower  Bounds 

In  Tab.  1,  we  repeat  the  lower  bounds  from  [5]  for  the  degree  of  replication,  the 
blocking  time  and  the  failover  time  for  the  various  kinds  of  failures.  Informally, 
a  protocol  is  C-blocking  if  in  all  failure-free  runs,  the  time  that  elapses  from 
the  moment  a  site  receives  a  request  until  a  site  sends  the  associated  response 
is  bounded  by  C.5  Failover  time  is  defined  to  be  the  longest  duration  (over  all 
possible  runs)  for  which  there  is  no  primary.  However,  the  failover  time  bounds 
only  hold  for  protocols  that  satisfy  the  following  additional  (and  reasonable) 
property. 

Pb5:  A  correct  site  that  is  the  primary  remains  so  until  there  is  a  failure. 


3  The  protocols  can  be  extended  to  the  more  realistic  model  in  which  clocks  are  only 
approximately  synchronized  [7], 

4  The  lower  bounds  are  also  tight  for  fail-stop  failures  [10]  except  for  the  bound  on 
failover  time. 

We  assume  that  it  takes  no  time  for  a  site  to  compute  the  response  to  a  request. 


i 


Table  1.  Lower  Bounds — Degree  of  Replication,  Blocking  Time  and  Failover  Time 


Failure  type 

Replication 

Blocking  time  (C) 

Failover  Time 

Crash 

n,  >  / 

0 

ft 

Crash+Link 

n,  >  /  +  1 

0 

2f6 

Send-Omission 

n.  >  / 

6  if /=  1 

26  if  /  >  1 

2f6 

Receive-Omission 

n*  >  l¥j 

6  if  n,  <  2/  and  /  =  1 
26  if  n,  <  2/  and  /  >  1 
0  if  n,  >  2/ 

If  6 

Gener  al-Omission 

n.  >  2/ 

2/6 

5  Summary  of  Results 

We  first  present  a  primary-backup  protocol  schema  that  will  be  used  to  derive 
the  protocols  for  all  the  failure  models.  This  schema  is  based  on  the  properties  of 
two  key  primitives,  broadcast  and  deliver,  that  sites  use  to  exchange  messages. 
We  show  that  the  schema  satisfies  Pbl — Pb5  by  only  using  these  properties  inde¬ 
pendent  of  the  particular  failure  model.  Each  failure  model — crash,  crash+link, 
send-omission,  receive-omission  and  general-omission — is  handled  with  a  differ¬ 
ent  implementation  of  broadcast  and  deliver,  and  in  all  but  one  case  optimal 
protocols  are  constructed. 

The  protocols  for  crash  and  crash+link  failures  show  that  all  the  correspond¬ 
ing  lower  bounds  are  tight.  The  protocol  for  general-omission  failures  uses  a 
translation  technique  similar  to  [8],  and  demonstrates  that  our  lower  bounds  for 
general-omission  failures  are  tight,  except  for  the  bound  on  blocking  time  when 
/  =  1.  However,  for  this  special  case  we  have  derived  a  different  protocol  (not 
described  in  this  paper)  having  optimal  blocking  time  .  In  all  failure  free  runs  of 
this  protocol,  the  site  that  receives  the  request  (i.e.  the  primary)  is  not  the  site 
that  sends  the  response  to  the  client.  We  show  that  this  behavior  is  necessary  in 
this  paper. 

We  do  not  show  the  protocols  for  send-omission  and  receive-omission  fail¬ 
ures  in  this  paper  because  they  are  similar  to  the  protocol  for  general-omission 
failures.  These  protocols  establish  that  the  bounds  for  send-omission  failures 
are  tight.  For  receive-omission  failures,  the  lower  bound  on  blocking  time  when 
n,  >  2/  and  the  lower  bound  on  failover  time  are  also  tight.  However,  our  pro¬ 
tocol  does  not  have  optimal  replication,  as  it  requires  n,  >  2/  (rather  than 

>  i¥J)- 

Finally,  in  [5]  we  proved  that  all  receive-omission  protocols  having  < 
n,  <  2/  necessarily  exhibit  a  scenario  in  which  a  non-faulty  primary  is  forced 
to  relinquish  control  to  a  faulty  backup.  In  Sect.  9,  we  describe  such  a  protocol: 
it  uses  two  sites  and  tolerates  a  single  receive-omission  failure.  In  addition,  this 


protocol  is  ^-blocking  and  so  it  demonstrates  that  our  lower  bound  on  blocking 
time  is  tight  for  n,  <  2/  and  /  =  1.  As  in  the  protocol  for  general  omission  when 
f  —  1,  it  is  the  backup  that  sends  responses  to  clients.  This  behavior  is  shown 
to  be  necessary  for  an  important  class  of  protocols. 

6  Protocols  for  the  Clients  and  the  ( k ,  zl)-bofo  server 

Property  Pb4  requires  that  the  primary-backup  service  behave  like  some  (it,  zl)- 
bofo  server.  Figure  1  gives  such  a  canonical  ( k ,  zl)-bofo  server  (say  s),  and  Fig.  2 
gives  the  protocol  for  client  i  interacting  with  s.  As  with  any  other  bofo  server, 
a  client  will  not  receive  the  response  to  a  request  if  either  the  request  to  s  or  the 
response  from  s  is  lost. 


initialize() 

cobegin 

||  inform-clients(“Z)eat  =  s”) 

||  do  forever 

when  received  request  from  client  c 
response  :=  II (state,  request) 
state  =  state  o  response 
send  response  to  client  c 
od 
coend 

procedure  initialize() 
state  :=  c 

procedure  inform-clients(ic) 
send  (»c)  to  all  clients 

Fig.  1.  Protocol  run  by  a  single  (k,A)  bofo-server  s 


In  Fig.  2,  response-time  corresponds  to  the  amount  of  time  the  client  has  to 
wait  in  order  to  get  the  response  from  s,  which  is  just  the  round  trip  message 
delay.  The  exact  value  for  response-time  depends  on  the  failure  model  being 
assumed. 

7  The  Primary-Backup  Protocol  Schema 

We  first  make  the  simplying  assumption  that  the  links  between  the  clients  and 
the  sites  are  non-faulty  and  there  are  no  omission  failures  between  the  clients  and 
the  sites  (i.e.  only  the  links  between  sites  can  be  faulty  for  crash+link  failures, 


cobegin  ~~~~  ~~ 

||  do  forever 

if  received  uDest  =  s”  then 
Desti  :=  s 
od 

||  do  forever 

if  want  to  send  request 
send  request  to  Dasti 

if  not  received  response  by  response-time  then 
recover()  /*  call  some  recovery  procedure,  which  might  retry  */ 
else 

od 

coend 

Fig.  2.  Protocol  run  by  client  t  interacting  with  server  s 


and  omission  failures  can  occur  only  between  sites  for  omission  failures).  We 
show  in  Sect.  7.1  how  this  assumption  can  be  removed. 

In  order  to  emulate  the  server  s  (and  consequently  satisfy  property  Pb4),  our 
primary-backup  protocol  consists  of  n5  sites  {si, . .  each  of  which  runs 

the  protocol  in  Fig.  3.  The  protocol  for  the  clients  remains  the  same. 


initialize(i) 

cobegin 

||  if  i  =  0  then  primary(i)  else  backup(i) 

||  delivery-process(i) 

||  failure-detector(i) 
coend 

Fig.  3.  Protocol  run  by  site  s,  to  emulate  server  s 


The  procedures  primary  and  backup  (shown  in  Fig.  4)  are  the  same  for  all 
the  failure  models.  On  the  other  hand,  the  implementation  of  the  procedures  ini¬ 
tialize,  broadcast(used  in  Fig.  4),  delivery-process  and  failure-detector 
change  depending  on  the  particular  failure  model.  However,  we  ensure  that  these 
different  implementations  always  satisfy  a  set  of  properties,  called  B1 — Bll  be¬ 
low.  We  extracted  these  properties  in  order  to  make  our  proofs  modular.  In 
particular  we  proved  that,  independent  of  the  failure  model,  the  protocol  in 
Figs.  3  and  4  satisfies  Pbl-Pb5,  as  long  as  the  remaining  procedures  satisfy 
B1 — Bll.  As  a  result,  we  could  then  prove  Pbl-Pb5  for  any  other  failure  model 


by  just  ensuring  that  the  implementation  of  broadcast,  delivery-process  and 
failure-detector  for  that  failure  model  satisfied  Bl-Bll. 


procedure  primary (j) 
cobegin 

||  inform-clients(“Dest  =  s,”) 

||  broadcast((aylastlog,  s,,  last(state,)),  j)  /*  to  all  sites  */ 
do  forever 

when  received  request  from  client  c 
response  II (state,,  request) 

state,  :=  state ,  o  response 
broadcast((log,  s,, response),  j) 
send  response  to  client  c 
od 
coend 

procedure  backup(t) 
do  forever 

((tag,s„r),j)  :=  Deq(Rqueuek) 

/*  assume  that  dequeueing  an  empty  queue 
does  not  return  any  sensible  value  of  tag  */ 

/*  synchronizing  with  the  new  primary  */ 
if  tag  =  aylastlog  then 
if  r  €  state*  then 

if  r  =  last(statek)  then  skip 
else  state*  :=  state*  \  last(statek ) 
else  state*  :=  state*  o  r 

/*  logging  response  from  primary  */ 
if  tag  =  log  then  state*  :=  state*  o  r 

/*  becoming  the  primary  */ 
if  V>  <  Jt  :  Faultgk[s,]  then  primary(it) 
od 


Fig.  4.  The  procedures  primary  and  backup 


We  now  give  the  properties  B1 — Bll.  In  these  properties,  d,  C  and  r  are  some 
constants  whose  values  depend  on  the  failure  model.  Intuitively,  d  corresponds 
to  the  amount  of  time  that  can  elapse  from  the  time  a  message  is  broadcast  to 
the  time  it  is  dequeued  by  the  receiver,  C  corresponds  to  the  blocking  time  and 
r  corresponds  to  the  interval  between  successive  UI  am  alive”  messages  that  sites 
send  to  each  other  (as  we  will  see  in  the  implementation  of  failure-detector). 


When  we  say  that  a  site  “halts” ,  we  mean  that  either  the  site  has  crashed  or 
has  stopped  executing  the  protocol  by  executing  a  stop.  The  array  of  booleans 
Faulty*  indicates  which  servers  s*  believes  has  halted:  Faultyk[sj ]  being  true 
implies  s*  believes  that  Sj  has  halted.  Finally,  we  define  a  broadcast  by  a  site  to 
be  successful  if  the  site  does  not  halt  during  the  execution  of  broadcast. 

The  properties  can  be  subdivided  according  to  the  procedures  to  which  they 
relate: 

Properties  of  broadcast  and  delivery-process: 

Bl:  If  Sj  initiates  a  broadcast  b'  after  broadcast  6,  then  no  site  dequeues  b’ 
before  6. 

B2:  If  Sj  initiates  a  broadcast  b  at  time  t,  then  no  site  dequeues  b  after  time 
t  -f-  d. 

B3:  If  Sj  initiates  a  broadcast  at  time  t  and  does  not  halt  by  time  t  +  C,  then 
the  broadcast  is  successful.  Furthermore,  no  broadcast  takes  longer  than  C 
to  complete. 

Properties  of  failure-detector: 

B4:  If  Faultyj[sk]  becomes  true,  then  it  continues  to  be  true,  unless  s,  halts. 
B5:  The  value  of  Faulty ,  [s*]  can  only  change  at  time  t  =  It +  d  for  some  integer 
1  >  0. 

B6:  If  Faultyj[sk]  =  true  at  time  t  then  s*  has  halted  by  time  t. 

B7:  If  Sj  has  not  halted  by  time  tj,  and  S{,  i  <  j  has  halted  by  time  12  where 
<i  =  t2  +  r  +  d,  then  Faultyj[si]  =  true  by  time  t\. 

Properties  of  broadcast  and  delivery-process  interacting  with  failure-de¬ 
tector: 

B8:  No  correct  site  halts  in  procedures  initialize,  broadcast ,  delivery— process 
or  failure-detector. 

B9:  If  Sj  initiates  a  successful  broadcast  at  time  t ,  then  for  all  non-halted  sites 
s*,  k  >  j ,  Fau/ty*[s;]  =  false  through  time  [^Ir  +  d. 

BIO:  If  Sj  initiates  a  successful  broadcast  b,  then  for  every  non-halted  site  s*: 

(Faultyk[sj}  =  true)  =>  (s*  has  dequeued  6). 

Bll:  If  Sj  initiates  a  broadcast  6  at  time  t  and  s*,Jfe  >  j  broadcasts  b',  then 
either  no  site  dequeues  6  after  b',  or  Faultyk[sj]  =  false  through  time  t  -f  d. 


7.1  Outline  of  the  Proof  of  Correctness 

We  now  informally  argue  that  the  protocol  in  Figs.  3  and  4  satisfies  Pbl-Pb5  as 
long  as  the  procedures  initialize,  broadcast,  delivery-process  and  failure- 
detector  satisfy  Bl — Bll. 

Define:  Prmy,}  at  time  t  =  Sj  has  not  halted  by  time  t 

A  Vt  <  j  :  Faulty  j  [s*]  =  true  at  time  t. 

From  the  above  definition,  Pbl  can  now  be  seen  from  B6  and  the  backup 
protocol  in  Fig.  4.  Pb2  trivially  follows  from  Fig.  2.  Pb3  follows  from  Fig.  4  as 


no  request  is  sent  to  a  site  Sj  before  Sj  becomes  the  primary.  Also,  Pb5  holds 
(from  B8  and  Fig.  4)  as  a  correct  primary  continues  to  be  the  primary.  We  now 
show  Pb4. 

In  order  to  show  Pb4,  we  need  to  show  two  things-the  state  of  the  new 
primary  is  consistent  with  the  state  of  old  primary;  and  all  outages  are  bounded. 
We  first  show  that  the  states  are  consistent. 

Starting  at  the  top  of  Fig.  4:  when  a  site  Sj  becomes  the  primary,  it  first 
informs  the  clients  of  its  identity  by  calling  inform-clients.  For  now,  ignore 
the  broadcast  of  (mylastlog,Sj,-)  by  primary  Sj. 

Whenever  Sj  gets  a  request  from  a  client,  it  computes  the  response,  changes 
state,  broadcasts  the  log  to  the  backups  and  sends  the  response  back  to  the 
client.  It  can  be  seen  from  Fig.  4  that  if  primary  Sj  sends  a  response  r  to  the 
client,  then  Sj  must  have  executed  a  successful  broadcast  of  (log,  Sj,r).  This  fact 
and  properties  B1,B2,B9  and  BIO  imply  that  (log, Sj,r)  must  also  have  been 
dequeued  by  any  backup  s*  before  s*  becomes  the  primary.  Thus,  the  state  of  s* 
will  continue  to  be  consistent  with  the  state  of  Sj  iff  the  states  were  consistent 
when  Sj  became  the  primary.  We  show  this  as  follows. 

Informally,  the  states  of  Sj  and  s*  could  be  inconsistent  when  Sj  becomes 
the  primary  for  the  following  reason.  Consider  a  scenario  in  which  some  primary 
Sj  crashes  during  the  broadcast  of  (log,s,-,r)  for  some  r.  It  is  possible  that  s* 
received  (log,  Sj,  r)  and  Sj  did  not.  As  a  result,  the  states  of  Sj  and  s*  now  differ. 
It  is  for  this  reason  that  Sj  broadcasts  (mylastlog,  Sj ,  r')  where  r'  —  last(statej ) 
on  becoming  the  primary.  On  receiving  this,  St  sees  that  r'  ^  last(statek)  = 
r  and  removes  r  from  its  state.  As  a  result,  state;  and  statet  become  equal. 
Similarly,  s*  would  add  r  to  its  state  had  Sj,  and  not  st,  received  (log,  Si,r). 

In  the  scenario  described  in  the  last  paragraph,  response  r  is  never  sent  to 
the  client  (i.e.  there  is  a  service  outage).  We  now  show  that  such  outages  are 
bounded.  s<  did  not  send  the  response,  and  so  by  B3,  must  have  halted  by  time 
t  (say).  Now  from  B7  either  s;+1  halts  or  becomes  the  primary  by  time  t  +  r  +  6. 
Since  no  correct  site  halts  (by  B8  and  Fig  4),  and  the  number  of  faulty  sites  are 
bounded  by  /,  there  eventually  will  be  a  time  when  there  is  a  correct  primary 
and  no  more  outages  occur. 

From  B3,  the  protocol  C-blocking.  Furthermore,  it  can  be  shown  from  B7, 
B8  and  Fig.  4  that  the  failover  time  of  the  protocol  is  f(d  +  r)  for  arbitrarily 
small  and  positive  r. 

However,  the  primary  procedure  in  Fig.  4  does  not  work  if  there  are  message 
losses  between  the  clients  and  the  sites  (due  to  link  or  omission  failures).  For 
example,  a  non-faulty  primary  might  omit  to  receive  all  requests  from  a  client 
due  to  a  failure,  violating  Pb4.  Similarly,  inform-clients  might  omit  to  inform 
some  of  the  clients.  However,  it  is  relatively  easy  to  account  for  these  failures 
when  clients  are  non-faulty.  Assume  that  there  is  an  upper  bound  (say  G )  be¬ 
tween  any  two  requests  from  a  client  and  that  requests  carry  sequence  numbers. 
If  the  primary  does  not  receive  any  requests  from  a  client  during  an  interval  of 
length  G  or  if  the  primary  receives  some  request  with  a  sequence  number  gap, 
then  the  primary  halts.  Similarly,  the  primary  can  detect  that  a  response  was 


lost  by  having  clients  acknowledge  responses.  If  such  an  acknowledgement  is  not 
received,  then  again  the  primary  halts.  Properties  Pbl-Pb5  can  again  be  shown 
to  be  true  if  we  make  the  above  modification  in  Figs.  2  and  4. 

8  Implementation  for  the  various  Failure  Models 

In  this  section,  we  show  how  to  implement  B1 — Bll  for  the  various  failure  mod¬ 
els. 


8.1  Crash  Failures 

The  procedures  implementing  B1 — Bll  for  crash  failures  are  given  in  Fig.  5. 
Whenever  we  say  that  a  site  “delivered  M” ,  we  mean  that  the  procedure  deliver 
has  been  called  with  M .  Enq  adds  an  element  to  the  head  of  a  queue  and  Deq 
dequeues  an  element  from  the  tail. 


procedure  initialize! A;) 
state k  :  =  Rqueuek  ■—  e 
Vi  :  Faulty*  [a,]  :=  false 

procedure  broadcast  (Af,  k) 
send  M  to  all  sites 

procedure  deliver  (Af,  k) 

Let  Af  be  of  the  form  (tag, 

if  tag  €  {log,  ay  last  log}  then  Enq(£yueuei(,  (Af,  fc)) 

procedure  delivery-process! A) 
do  forever 

if  received  Af  then  deliver(Af,  k) 
od 

procedure  failure-detec tor(l) 
cobegin 

||  for  i  :=  0  to  oo 

when  current-time  =  ir:  send  (alive,  s*,  ir)  to  all  sites 
||  for  i  :=  0  to  oo 

when  current-time  =  ir  +  d: 

V;  :  if  not  delivered  (alive,  s,,  ir)  then  Fau/fy*[s,]  :=  true 

coend 


Fig.  5.  Procedures  for  crash  failures 


We  now  informally  argue  that  Bl-Bll  hold  for  this  implementation  if  d  —  6 
and  C  =  0.  B1  holds  as  channels  are  FIFO  and,  B2  holds  as  d  =  6  and  the 
maximum  message  delivery  time  is  also  6.  B3,  B4  and  B5  can  be  seen  trivially. 
B6  and  B7  can  be  seen  from  failure— detector  as  there  are  no  message  losses 
and  message  delivery  time  is  atmost  6.  B8  holds  trivially.  It  can  be  shown  that 
if  Sj  halts  at  time  t,  then  no  site  sets  Fati/ty^]  to  true  before  time  t  +  6.  B9, 
BIO  and  Bll  now  follow. 

The  procedures  in  Fig.  5  require  ns  >  /,  and  so  the  lower  bound  on  the 
degree  of  replication  is  tight.  Since  C  =  0  and  d  =  <5,  from  Sect.  7.1,  the  lower 
bounds  on  blocking  time  and  failover  time  are  tight  as  well. 

8.2  Crash+Link  Failures 

The  procedures  in  Sect.  8.1  do  not  work  if  links  can  fail.  For  example,  if  Sj 
sends  a  message  to  s*  then  the  message  might  not  reach  s*  due  to  a  link  failure 
(which  will  violate  B6  and  BIO).  We  therefore  replace  the  implementation  in 
Fig.  5.  with  the  one  in  Fig.  6,  except  that  deliver  is  the  same  as  before.  For 
this  implementation,  d  =  26  and  C  =  0.  These  procedures  use  fifo-broadcast 
and  fifo— deliver  in  Fig.  7  which  ensure  that  intermittent  link  failures  become 
permanent  failures:  if  Sj  fifo-broadcasts  a  message  m  to  s*  and  s*  omits  to 
fifo-deliver  m,  then  s*  will  not  fifo-deliver  any  subsequent  message  from  Sj . 


It  can  be  shown  (proof  omitted)  that  this  new  implementation  again  satisfies 
Bl-Bll  if  ns  >  /  +  1.  Informally,  this  is  true  because  of  the  following  reason. 
Whenever  Sj  initiates  a  broadcast  of  M  at  time  t,  it  sends  M  to  all  sites,  and  the 
sites  then  relay  M  to  all  other  sites.  Since  ns  >  /  + 1,  there  is  always  at  least  one 
non-faulty  path  between  any  two  non-crashed  sites,  where  a  path  consists  of  zero 
or  one  intermediate  sites.  Therefore,  if  Sj  does  not  crash  during  the  broadcast, 
then  all  non-crashed  sites  will  deliver  M  by  time  t  +  26.  Furthermore  B1  will  be 
satisfied  because  of  the  FIFO  properties  of  fifo-broadcast  and  fifo-deliver. 

This  crash+link  protocol  requires  ns  >  /  +  1,  is  0-blocking  (since  C  —  0), 
and  has  a  failover  time  of  f(26  -I-  r)  (since  d  =  26).  Thus,  all  lower  bounds  for 
crash+link  failures  are  tight. 


8.3  General-Omission  Failures 

The  implementation  of  the  procedures  for  general-omission  failures  is  given  in 
Figs.  8  and  9,  except  delivery-process  which  is  the  same  as  Fig.  6.  Whenever, 
we  say  that  a  site  “fifo-delivered  M” ,  we  mean  that  the  procedure  fifo-deliver 
was  called  with  M.  These  procedures  were  developed  using  a  technique  similar 
to  [8]  (although  modified  to  work  in  our  non-round-based  model)  which  requires 
n,  >  2/  and  d  =  26. 


procedure  initialize^) 
statek  :=  Rqueuek  :=  Dqueuek  :=  f 
V»  :  Fau/ty*[s,]  :=  false 
last-sentk  :=  Vj  :expectedk[j]  :=  0 

procedure  broadcast  M,  Jfc) 
time  :=current-time 
flfo- broadcast init,  M,  Sk,  time) 

procedure  delivery-process(i) 
cobegin 

||  flfo-delivery-process(ifc) 

||  do  forever 

{tag,  M,  —,  t)  :=Deq (Dqueuek) 

if  tag  =  init  then  flfo-broadcast  (echo,  M,sk,t) 

if  tag  =  echo  and  not  dequeued  (tag,  M,  — ,  <)  before  then  deliver  (M,  k) 
od 
coend 

procedure  failure— detect  or  (fc) 

A)  =  (alive,  s,,  ir) 
cobegin 

||  for  i  :=  0  to  oo 

when  current-time  =  ir:  flfo-broadcast(init,  A'k,  Sk,  ir) 

||  for  i  :=  0  to  oo 

when  current-time  —  ir  +  d: 

'ij  :  if  not  delivered  A)  then  Faulty k(sj]  :=  true 

coend 

Fig.  6.  Procedures  for  crash+link  failures 


procedure  fifo-broadcast(ta£,  M,  sk,  t) 
send  (tag,  M,  Sk,  t, last-sentk)  to  all 
last-sentk  :=last-sentk  +  1 

procedure  flfo-deliver  (tag,  M,  Sj,t) 
Enq (Dqueuek,  (tag,  M,  s,,t)) 

procedure  flfo-delivery-process  (it) 
do  forever 

if  received  (tag,  M,  s},  t,  last3 )  then 
if  (lastj  i&expectedk[j ])  then  skip 
else 

expectedk[j]  :=  expectedk[j]  +  1 
flfo-deliver  (tag,  M,  s,,t) 
od 


Fig.  7.  Procedures  for  crash+link  failures 


procedure  initialize)!) 
atatek  :=  Rqitevek  '•=  Dqueuek  '■=  f 
V»  :  Faulty*  [a^]  :=  false 

current-primary.-=last-$entk  :=  Vj  : expecte<f*[;]  :=  0 

procedure  broadcast)!/,  it) 
time  :=  current-time 
flfo-broadcast(init,  M,  Sk,  time) 
if  by  time  +  d  fifo-delivered  (echo,  M,  s},time) 
for  at  least  n,  -  f  different  j  then  return 
else  stop 

procedure  deliver  (M,  k) 

Let  M  be  of  the  form  {tag,  Sj,  —) 
if  tag  €  {log,  ay  last  log}  then 

if  j  <  current-primary  then  return 
else 

current-primary.=  j 
Enq (Rqueuek,  (A/,!)) 

Fig.  8.  Procedures  for  general-omission  failures 


We  now  briefly  argue  that  these  procedures  satisfy  B1 — Bll.  The  detailed 
proof  is  omitted  from  this  paper.  Had  we  used  the  implementation  of  broadcast 
in  Fig.  6,  BIO  (in  particular)  would  be  violated  because  a  faulty  primary  Sj 
might  omit  to  send  the  logs  to  the  backups.  Therefore,  in  Fig.  8,  Sj  stops  in 
the  broadcast  of  a  response  (say  r)  if  less  than  n8  —  /  sites  fifo-deliver  and 
subsequently  fifo-broadcast  r.  However,  even  if  Sj  does  not  stop  in  the  broadcast, 
a  faulty  (but  non-crashed)  site  s*  might  still  omit  to  deliver  r,  due  to  a  receive- 
omission  failure,  and  later  become  the  primary  were  Sj  to  fail.  To  prevent  this,  s* 
ensures  (in  procedure  failure-detector)  that  it  fifo-delivers  some  message  (say 
m')  from  at  least  one  of  the  above  ns  -  /  sites  that  had  earlier  fifo-broadcast  r .  If 
St  does  not  receive  such  an  m',  then  s*  stops.  Now,  if  s*  omitted  to  fifo-deliver 
r,  then  by  the  properties  of  fifo-broadcast  and  fifo-deliver,  s*  cannot  fifo-deliver 
m'  and  would  stop  (and,  therefore,  cannot  become  the  primary).  Property  B6 
is  similarly  satisfied  by  ensuring  that  sites  detect  their  own  failure  to  send  or 
receive  alive  messages  and  therefore  stop. 

These  procedures  require  n,  >  2/,  d  =  2 S  and  C  =  26.  Furthermore,  we  have 
developed  a  protocol  for  /  =  1  (omitted  in  this  paper)  that  is  ^-blocking.  Thus, 
we  establish  that  all  lower  bounds  for  general-omission  failures  are  tight. 

As  mentioned  earlier,  the  6-blocking  protocol  for  /  =  1  has  scenarios  in  which 
the  site  that  receives  the  request  is  not  the  site  that  responds  to  the  clients.  This 
is  in  fact  necessary.  Define  a  protocol  to  be  “pass  the  buck”  if  in  any  failure-free 
run  of  the  protocol,  the  site  that  receives  a  request  is  not  the  site  that  sends  the 
corresponding  response. 


procedure  failure-detector(ifc) 

V»,j  :  A\  :=  (alive, 

Vi,>  :  Fj  :=  (fault,  a},  ir) 
cobegin 

||  for  »  :=  0  to  oo 

when  current-time  =  ir:  fifo-broadcast(init,  A'k,  s*,  ir) 

||  for  t  :=0  to  oo 

when  current-time  =  ir  +  6: 

Vj  :  if  not  fifo-delivered  (init,  Afs,,  tr)  then 
flfo-broadcast  (echo,  F'1%  s*,  tr) 

||  for  i  :=  0  to  oo 

when  current-time  =  ir  4-  d: 

witnessk[k]  :=  {sj|  fifo-delivered  (echo,  A*, ir)} 

Vj  5^  k  :  witnessk\j]  {si|fifo-delivered  (echo,  A'},  Si,  ir)  or 
fifo-delivered  (echo,  fy,  a,,  ir)} 
if  3 j  :  \witnes3k\j]\  <  n.  -  /  then  stop 
if  3 j  :  not  delivered  A,  then  /’anlfy*[aJ]  :=  true 

coend 

Fig.  9.  Procedures  for  general-omission  failures 

Theorem  1.  Any  C -blocking  protocol,  where  C  <  26,  for  send-omission  failures 
is  “ pass  the  buck”. 

Proof.  Omitted  in  this  paper.  See  [4].  □ 

8.4  Other  Failure  Models 

The  implementations  of  the  procedures  for  send-omission  and  receive-omission 
failures  are  similar  to  those  for  general-omission  failures  and  so  are  omitted 
from  this  paper.  For  receive-omission  failures,  the  lower  bound  on  the  degree 
of  replication  and  the  lower  bound  on  blocking  time  when  ns  <  2/  and  /  >  1 
are  not  tight.  Finding  optimal  protocols  remains  an  open  problem.  However,  the 
lower  bound  on  failover  time  for  receive-omission  failures,  and  all  lower  bounds 
for  send-omission  failures  Eire  tight. 

9  A  Surprising  Protocol 

We  now  describe  a  5-blocking  protocol  tolerating  receive-omission  failures  for 
the  special  case  of  ns  =  2  and  /  =  1.  This  protocol  is  complex,  and  so  we  omit 
the  detailed  description  and  only  outline  the  protocol’s  operation  here.  This 
protocol  shows  that  our  lower  bound  on  blocking  time  when  n,  <  2/  and  /  =  1 
is  tight.  The  protocol  has  the  odd  (yet  necessary  as  shown  in  [5])  property  that 
a  non-faulty  primary  is  forced  to  relinquish  to  a  faulty  backup.  Furthermore,  the 
protocol  is  “pass  the  buck” .  We,  however,  show  that  most  6-blocking  protocols 
tolerating  receive  omission  failures  have  to  be  “pass  the  buck” . 


Informally,  let  T  be  the  maximum  time  between  any  two  successive  client 
requests  (possibly  from  different  clients),  and  let  D  be  such  that  if  some  site  s 
becomes  the  primary  at  time  to  and  remains  the  primary  through  time  t  >tc  +  D 
when  a  client  i  sends  a  request,  then  Desti  =  s  at  time  t.  We  write  D  <  T  to 
mean  that  D  is  bounded  and  f  is  either  unbounded  or  bounded  and  greater 
than  D.  Then 

Theorem  2.  Any  C -blocking  protocol,  where  C  <  26,  for  receive- omission  fail¬ 
ures  with  n9  <  2/  and  D  <  T  is  “ pass  the  buck". 

Proof.  Omitted  from  this  paper.  □ 

Whether  a  protocol  has  to  be  “pass  the  buck”  when  the  relation  D  <  T  does 
not  hold  is  an  open  question. 

We  now  describe  the  protocol.  There  are  two  sites  so  and  si.  They  commu¬ 
nicate  with  each  other  using  fifo-broadcast  and  fifo-deliver  shown  in  Fig.  7. 
Henceforth,  when  we  say  that  a  site  sends  a  message  to  the  other,  we  will  mean 
that  the  message  is  sent  with  fifo-broadcast  and  other  site  receives  it  with 
fifo-deliver. 

In  a  failure-free  run  of  this  protocol,  since  the  backup  responds  to  the  client, 
the  primary  forwards  any  response  to  the  backup  (with  a  green  tag  as  we  see 
below)  and  the  backup  sends  this  response  to  the  client.  However,  if  there  is 
a  failure,  then  the  primary  responds  to  the  clients.  In  this  case,  the  primary 
forwards  a  response  to  the  backup  with  a  red  tag.  The  backup  does  not  forward 
a  response  to  the  client  if  the  response  has  a  red  tag. 

Let  s0  initially  be  the  primary.  Whenever  so  receives  a  request  from  the 
client,  it  computes  a  response  r,  changes  state,  and  sends  (green, r)  to  Si .  Upon 
receiving  this  message,  si  updates  its  state,  acknowledges  to  so,  and  then  sends 
r  to  the  client.  Because  it  is  the  backup  that  responds  to  the  client,  the  protocol 
is  6-blocking.  Site  so  processes  a  new  request  only  after  receiving  the  acknowl¬ 
edgement  from  si  for  the  previous  request.  Finally,  s o  periodically  sends  alive 
messages  to  st,  and  sj  acknowledges  these  messages. 

Suppose  that  so  does  not  get  si ’s  acknowledgement  for  some  message,  say, 
(green, r)  (the  argument  is  similar  if  no  acknowledgement  is  received  for  an 
alive  message).  There  are  three  possibilities:  (1)  «i  has  crashed,  (2)  s\  omitted 
to  receive  (green, r)  and  so  did  not  send  the  acknowledgement,  (3)  so  omitted  to 
receive  the  acknowledgement,  so  now  waits  until  it  is  supposed  to  send  the  next 
alive  message.  Sq  sends  this  alive  message  and  waits  for  an  acknowledgement. 
We  now  consider  the  above  three  cases  separately. 

Case  1:  si  has  crashed.  As  a  result,  so  does  not  receive  the  acknowledgement 
to  the  alive  message,  so  continues  as  the  primary.  From  then  on,  whenever  so 
receives  a  request  from  the  client,  it  computes  the  response  r,  sends  (red,r)  to 
s i,  and  then  sends  the  response  back  to  the  client.  Also,  so  continues  to  send 
aliva  messages.  Since  so  is  correct,  it  can  continue  like  this  forever. 

Case  2:  si  is  faulty  and  omitted  to  receive  (green, r).  By  the  property  of  fifo- 
broadcast  and  fifo-deliver,  si  will  not  receive  the  alive  messages  that  sq  sends. 


sj  concludes  that  so  has  crashed,  sends  (“si  is  primary”)  to  sq  and  becomes  the 
primary.  After  that,  it  behaves  like  so  in  case  1  above  (including  sending  alive 
messages  to  so).  Since  so  is  correct,  it  receives  (“si  is  primary”)  (as  opposed  to 
case  1)  and  so  it  becomes  the  backup.  Also,  since  so  is  correct  it  will  not  omit 
to  receive  (red,r)  messages  that  si  sends  and  so  so  keeps  its  state  consistent 
with  si.  Subsequently,  if  so  stops  receiving  alive  messages  from  sj,  then  s i  has 
crashed  and  so  becomes  the  primary  once  again. 

Case  3:  so  is  faulty.  Since  si  is  correct,  it  receives  the  alive  message  from  so, 
sends  the  corresponding  acknowledgement  and  remains  the  backup  (as  opposed 
to  case  2).  However,  by  the  property  of  fifo-broadcast  and  fifo-receive,  so  will 
not  receive  this  acknowledge  to  the  alive  message  (or  the  (“si  is  primary”) 
message),  and  so  it  behaves  as  in  case  1  and  continues  as  the  primary.  Similar  to 
case  2,  s\  receives  all  (red,r)  messages  that  so  sends  and  so  its  state  is  consistent 
with  so-  Finally,  si  becomes  the  primary  if  it  stops  receiving  alive  messages 
from  so- 

Case  2  in  the  protocol  is  the  odd  scenario  in  which  the  correct  primary  so 
is  being  forced  to  relinquish  to  si ,  known  to  be  faulty.  However,  this  scenario  is 
not  something  peculiar  to  our  protocol.  We  showed  in  [5]  that  relinquishing  to  a 
faulty  backup  is  necessary  when  ns  <2/. 


10  Discussion 

In  [5],  we  present  lower  bounds  for  primary-backup  protocols  which  constrain 
the  degree  of  replication,  the  failover  time,  and  the  amount  of  time  it  can  take 
to  respond  to  a  client  request.  In  this  paper,  we  derive  matching  protocols  and 
show  that  all  except  two  of  these  lower  bounds  are  tight.  Furthermore,  we  show 
that  in  some  cases  the  optimal  response  time  can  only  be  obtained  if  the  site 
that  receives  the  request  is  not  site  that  sends  the  response  to  the  clients. 

We  have  attempted  to  give  a  characterization  of  primary-backup  that  is  broad 
enough  to  include  most  synchronous  protocols  that  are  considered  to  be  instances 
of  the  approach.  There  are  protocols,  however,  that  are  incomparable  to  the  class 
of  protocols  we  analyze  as  these  protocols  were  developed  for  an  asynchronous 
system  [6,9].  We  are  currently  studying  possible  characterizations  for  a  primary- 
backup  protocol  in  an  asynchronous  system  and  expect  to  extend  our  results  to 
this  setting. 
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